A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction
نویسندگان
چکیده
Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.
منابع مشابه
Towards a Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...
متن کاملChinese Unknown Word Extraction by Mining Maximized Substrings
The issue of identifying out-of-vocabulary (OOV) words is a major difficulty in Chinese word segmentation. We address this issue by applying a very efficient algorithm for extracting maximized substrings (Shen et al., 2013) from a large-scale raw text, which form a list of unknown word candidates. We then apply techniques such as Short-term Store and Lexicon-based Voting to reduce the noises in...
متن کاملCascade Markov random fields for stroke extraction of Chinese characters
Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines Preprint submitted to Elsevier 29 September 2009 both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the lowlevel stroke segmentation proce...
متن کاملA Fast Algorithm of Address Lines Extraction on Complex Chinese Mail Pieces
A fast and efficient method is presented to extract address lines on both machine printed and handwritten Chinese mail envelopes. The algorithm is based on a bottom-up approach. First, we select out text blocks from connected components (CCs) and immediately group the text blocks into the initial lines. Then, the average text block features are computed to validate the initial text lines and gu...
متن کاملA Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic and statistical heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two major components: a tagging component that annotates each character in a Chinese sentence with a position-of-character (POC) tag that indicates its position in a word, and a merging com...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003